open model
So Long, GPT-5. Hello, Qwen
In the AI boom, chatbots and GPTs come and go quickly. On a drizzly and windswept afternoon this summer, I visited the headquarters of Rokid, a startup developing smart glasses in Hangzhou, China. As I chatted with engineers, their words were swiftly translated from Mandarin to English, and then transcribed onto a tiny translucent screen just above my right eye using one of the company's new prototype devices. Rokid's high-tech spectacles use Qwen, an open-weight large language model developed by the Chinese ecommerce giant Alibaba. OpenAI's GPT-5, Google's Gemini 3, and Anthropic's Claude often score higher on benchmarks designed to gauge different dimensions of machine cleverness.
- Asia > China > Zhejiang Province > Hangzhou (0.25)
- North America > United States > Michigan (0.05)
- North America > United States > California (0.05)
- (2 more...)
Nvidia Becomes a Major Model Maker With Nemotron 3
The world's top chipmaker wants open source AI to succeed--perhaps because closed models increasingly run on its rivals' silicon. Nvidia CEO Jensen Huang arrives for a meeting with lawmakers in Washington, DC. Nvidia has made a fortune supplying chips to companies working on artificial intelligence, but today the chipmaker took a step toward becoming a more serious model maker itself by releasing a series of cutting-edge open models, along with data and tools to help engineers use them. The move, which comes at a moment when AI companies like OpenAI, Google, and Anthropic are developing increasingly capable chips of their own, could be a hedge against these firms veering away from Nvidia's technology over time. Open models are already a crucial part of the AI ecosystem with many researchers and startups using them to experiment, prototype, and build.
- North America > United States > District of Columbia > Washington (0.25)
- Asia > China (0.19)
- North America > United States > California (0.05)
- (2 more...)
The US Needs an Open Source AI Intervention to Beat China
Depending on foreign-made open models is both a supply chain risk and an innovation problem, experts say. Since 2022, America has had a solid lead in artificial intelligence thanks to advanced models from high-flying companies like OpenAI, Google DeepMind, Anthropic, and xAI. A growing number of experts, however, worry that the US is starting to fall behind when it comes to minting open-weight AI models that can be downloaded, adapted, and run locally. Open models from Chinese companies like Kimi, Z.ai, Alibaba, and DeepSeek are now rapidly gaining popularity among researchers and engineers worldwide, leaving the US as a laggard in an increasingly vital area of AI innovation. "The US needs open models to cement its lead at every level of the AI stack," Nathan Lambert, founder of the ATOM (American Truly Open Models) Project, tells WIRED.
- Asia > China (0.44)
- North America > United States > Washington > King County > Seattle (0.05)
- North America > United States > California (0.05)
- (2 more...)
Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish
Xia, Chengxuan, Wu, Qianye, Guan, Hongbin, Tian, Sixuan, Hao, Yilun, Wu, Xiaoyu
Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs -- including GPT-4o, GPT-4, Claude~3.5~Sonnet, LLaMA~3.1, Mistral~Large~2, LLaMA-2~Chat~13B, and Mistral~7B~Instruct -- on a new cross-lingual benchmark covering \textbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine \textbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude~3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude~3.5~Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-2~13B, Mistral~7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.14)
- Asia > Middle East > Republic of Türkiye (0.04)
- (4 more...)
OpenAI's Open-Weight Models Are Coming to the US Military
OpenAI's Open-Weight Models Are Coming to the US Military The gpt-oss models are being tested for use on sensitive military computers. But some defense insiders say that OpenAI is still behind the competition. When OpenAI unveiled its first open-weight models in years this August, it wasn't just tech companies that were paying attention. The release also excited US military and defense contractors, which saw a chance to use them for highly secure operations. Initial results show that OpenAI's tools lag behind competitors in desired capabilities, some military vendors tell WIRED.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- (4 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Bhatti, Hunzalah Hassan, Alam, Firoj
Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.
- Europe > Austria > Vienna (0.15)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (10 more...)
Extracting alignment data in open models
Barbero, Federico, Gu, Xiangming, Choquette-Choo, Christopher A., Sitawarin, Chawin, Jagielski, Matthew, Yona, Itay, Veličković, Petar, Shumailov, Ilia, Hayes, Jamie
In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of $10\times$) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Singapore (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.91)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
Peng, Xueqing, Qian, Lingfei, Wang, Yan, Xiang, Ruoyu, He, Yueru, Ren, Yang, Jiang, Mingyang, Zhang, Vincent Jim, Guo, Yuqing, Zhao, Jeff, He, Huan, Han, Yi, Feng, Yun, Jiang, Yuechen, Cao, Yupeng, Li, Haohang, Yu, Yangyang, Wang, Xiaoyu, Gao, Penglei, Lin, Shengyuan, Wang, Keyi, Yang, Shanshan, Zhao, Yilun, Liu, Zhiwei, Lu, Peng, Huang, Jerry, Wang, Suyuchen, Papadopoulos, Triantafillos, Giannouris, Polydoros, Soufleri, Efstathia, Chen, Nuo, Deng, Zhiyang, Fu, Heming, Zhao, Yijia, Lin, Mingquan, Qiu, Meikang, Smith, Kaleb E, Cohan, Arman, Liu, Xiao-Yang, Huang, Jimin, Xiong, Guojun, Lopez-Lira, Alejandro, Chen, Xi, Tsujii, Junichi, Nie, Jian-Yun, Ananiadou, Sophia, Xie, Qianqian
Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.
- Europe > Austria > Vienna (0.14)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Middle East > Jordan (0.04)
- (12 more...)
- Financial News (1.00)
- Research Report > New Finding (0.45)
- Government (1.00)
- Banking & Finance > Trading (1.00)
- Information Technology > Security & Privacy (0.92)
- (2 more...)
Leveraging LLMs for Semi-Automatic Corpus Filtration in Systematic Literature Reviews
Joos, Lucas, Keim, Daniel A., Fischer, Maximilian T.
The creation of systematic literature reviews (SLR) is critical for analyzing the landscape of a research field and guiding future research directions. However, retrieving and filtering the literature corpus for an SLR is highly time-consuming and requires extensive manual effort, as keyword-based searches in digital libraries often return numerous irrelevant publications. In this work, we propose a pipeline leveraging multiple large language models (LLMs), classifying papers based on descriptive prompts and deciding jointly using a consensus scheme. The entire process is human-supervised and interactively controlled via our open-source visual analytics web interface, LLMSurver, which enables real-time inspection and modification of model outputs. We evaluate our approach using ground-truth data from a recent SLR comprising over 8,000 candidate papers, benchmarking both open and commercial state-of-the-art LLMs from mid-2024 and fall 2025. Results demonstrate that our pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Furthermore, modern open-source models prove sufficient for this task, making the method accessible and cost-effective. Overall, our work demonstrates how responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.
- Research Report > New Finding (1.00)
- Overview (1.00)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models
Sharma, Shubham, Tuli, Sneha, Badam, Narendra
Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI's closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.
- Overview (1.00)
- Research Report > Promising Solution (0.34)
- Health & Medicine (1.00)
- Education (1.00)
- Law (0.93)